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Abstract 

Background: Accurate classification of patients with a complex disease into subtypes has important implications for 
medicine and healthcare. Using more homogeneous disease subtypes in genetic association analysis will facilitate the 
detection of new genetic variants that are not detectible using the non-differentiated disease phenotype. Subtype 
differentiation can also improve diagnostic classification, which can in turn inform clinical decision making and 
treatment matching. Currently, the most sophisticated methods for disease subtyping perform cluster analysis using 
patients' clinical features. Without guidance from genetic information, the resultant subtypes are likely to be 
suboptimal and efforts at genetic association may fail. 

Results: We propose a multi-view matrix decomposition approach that integrates clinical features with genetic 
markers to detect confirmatory evidence for a disease subtype. This approach groups patients into clusters that are 
consistent between the clinical and genetic dimensions of data; it simultaneously identifies the clinical features that 
define the subtype and the genotypes associated with the subtype. A simulation study validated the proposed 
approach, showing that it identified hypothesized subtypes and associated features. In comparison to the latest 
biclustering and multi-view data analytics using real-life disease data, the proposed approach identified clinical 
subtypes of a disease that differed from each other more significantly in the genetic markers, thus demonstrating the 
superior performance of the proposed approach. 

Conclusions: The proposed algorithm is an effective and superior alternative to the disease subtyping methods 
employed to date. Integration of phenotypic features with genetic markers in the subtyping analysis is a promising 
approach to identify concurrently disease subtypes and their genetic associations. 

Keywords: Genotype-phenotype association, Multi-view data analysis, Subtyping, Biclustering, Matrix decomposition 



Background 

For complex diseases, such as substance dependence or 
psychiatric disorders, a variety of clinical features that 
collectively indicate or characterize the disease pheno- 
type often vary substantially among individuals [1]. Stud- 
ies of genetic association or those that aim to match 
patients with certain treatments for a complex disease can 
be impeded by this phenotypic heterogeneity [2]. Case- 
control association studies based on a binary trait, such 
as the diagnosis of a disease, which partitions the popula- 
tion into cases (subjects with the disease) and non-cases 
(subjects without the disease), cannot differentiate the 
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heterogeneous manifestations of the disease. Although 
many candidate genes or genomic regions have been asso- 
ciated with complex diseases [3], the characteristics or 
subtypes of the disease for which the association exists 
remain to be specified. For instance, the specific addictive 
behaviors that underlie the associations with candidate 
genetic variants need to be elucidated to clarify the risk 
for addiction [4]. 

Classification of a complex disease into homogeneous 
subcategories or subtypes may help to identify the genetic 
variants contributing to the effect of the subphenotypes 
[5,6]. However, prior studies have been limited to unsu- 
pervised cluster analysis or latent class analysis on clinical 
features to derive subtypes. Genotypic data have only been 
used to evaluate the validity of subtypes, such as in subse- 
quent association tests with the derived subtypes, rather 
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than to guide the creation of the subtypes. Consequently, 
the resultant subtypes may be of limited utility in genetic 
association analysis. Integration of data from both clini- 
cal and genomic dimensions also offers opportunities to 
find confirmatory evidence of a subtype based on both its 
genetic and clinical features. A few studies have examined 
the joint use of gene expression and genotypic data for 
cancer subtyping [7,8], but they did not identify a variable 
subspace (or a subset of features) in each data source so as 
to group subjects consistently across the two subspaces. 
Hence, they could not detect genetic variants associated 
with the identified clusters. 

There has also been little research on this topic in 
the statistics literature. The most relevant area involves 
co-clustering [9] or multi-view data analysis [10], where 
samples are characterized or viewed in multiple ways, 
thus creating multiple sets of input variables. There are 
two types of co-clustering methods: (1) biclustering, also 
called two-mode clustering [11,12], which simultaneously 
clusters the rows and columns of a data matrix and (2) 
multi-view co-clustering [9,13], which seeks groupings 
that are consistent across different views. Biclustering is 
similar to another set of algorithms that search for sub- 
spaces and group subjects differently in each subspace 
[14]. 

Biclustering and subspace searching essentially identify 
different subgroups of subjects using different features (or 
markers), thus helping to identify genetic variants specific 
to a particular subgroup. However, this method can only 
be applied to one data matrix from a single view rather 
than data jointly from multiple views. Multi-view co- 
clustering, on the other hand, seeks a grouping of subjects 
that is consistent across different views (i.e., different sets 
of features), but the resultant clusters are defined using 
all of the available features, e.g., all of the studied genetic 
markers. Hence, it cannot be used to identify subtype- 
specific variants/features. Thus, to address our subtyping 
problem, we not only partitioned subjects in such a way 
that the subgroups differed in both clinical features and 
genetic markers, but also included a subspace search to 
identify the specific features or markers that defined the 
subgroups. 

In this paper, we propose a multi-view matrix decompo- 
sition approach based on the sparse singular value decom- 
position (SSVD) technique [12] to classify a complex 
disease into subtypes using data both from the clinical and 
genetic views. The objective of this problem is to identify 
subject clusters that agree in the clinical and genetic views, 
and simultaneously identify features and markers that are 
associated with the clusters. Employing the sparse SVD in 
our approach is critical to its success, especially in terms 
of successfully detecting associated variants given that the 
number of truely associated variants are much fewer than 
the number of single nucleotide polymorphisms (SNPs) 



in the whole genome. The proposed approach was vali- 
dated on synthetic datasets that were simulated to have 
subtype structures and several genetic markers associated 
with the subtypes and a real world clinical dataset that 
was aggregated from multiple genetic studies of substance 
dependence. We compared our approach to a bicluster- 
ing approach [12] and the latest multi-view data analytics 
methods [9]. The results clearly show that the perfor- 
mance of our approach is superior to that of all other 
available methods. 

Methods 

We start with a presentation of the notations that are 
used throughout the paper. A vector is denoted by a bold 
lower case letter as in v and ||v||^ represents its £^-norm, 
which is defined by ||v||^ = (\v a) f + • • • + \v {d )\ p ) 1/p , 
where V(# is the ;-th component of v and d is the length 
of v, i.e., the total number of components in v. We use 
||v||o to represent the so-called 0-norm of v that equals the 
number of non-zero components in v. Denote u © v the 
component- wise (Hadamard) products of u and v. The set 
Bd contains all binary vectors of length d. A binary vec- 
tor is a vector whose components equal either 0 or 1. A 
matrix is denoted by a bold upper case letter, e.g., M nx d is 
a n-by-d matrix, and ||M||p is its Frobenius norm defined 
by (tr(M T M)) 1 ^ 2 where tr(-) is the trace of a matrix. 
Rows and columns in M are denoted by M(/,.) and M( v ), 
respectively. 

Review of single-view biclustering 

We briefly review the biclustering method with a single 
view of data based on the sparse singular value decom- 
position [12]. For a single data matrix M of size n-by-d, a 
subgroup of its rows and a subgroup of its columns can be 
simultaneously obtained by the SSVD. The SSVD requires 
both the left and right singular vectors to be sparse. Let 
u of size n and v of size d be a pair of singular vectors 
resulting from the SSVD. Their outer product forms a 
sparse low-rank approximation of the original matrix, i.e., 
M = cruv T where o is the corresponding singular value. 
Then, the rows in M that correspond to non-zero compo- 
nents in u form a row subgroup. The columns in M that 
correspond to non-zero components in v form a column 
subgroup. The resultant row and column clusters help to 
define one another. The SSVD finds all singular vectors 
sequentially by repeatedly solving the following problem 
with a data matrix M: 

min ||M - cru\ T \\j + X M ||<xu|| 0 + A. v ||orv|| 0 

a,u,v 

subject to ||u|| 2 = 1, ||v|| 2 = 1. 

(1) 

The regularization terms ||<ru||o and ||<xv||o are used to 
enforce the sparsity of u and v. Note that the scalar a 
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will not affect the value of the regularization terms. The 
parameters X u and X v are two hyper-parameters to balance 
the approximation performance and the regularization 
terms. If both k u and k v equal 0, the optimal solution to 
this problem is the left and right singular vectors of M that 
correspond to its largest singular value. An alternating 
algorithm has been proposed in [12] to solve this problem 
effectively when X u and X v are not 0. This algorithm first 
initiates u and v by the first left and right singular vectors 
of M, then alternates between solving two sub-problems 
until it converges. The two sub-problems are: (a), fix u and 
find v that optimizes the objective of Eq.(l); (b), fix v and 
find u that optimizes the objective of Eq.(l). 

Assume that each row of M represents a subject and 
each column corresponds to a feature. Once a pair of 
vectors u and v is obtained, a subject (row) cluster as indi- 
cated by the non-zero components of u is obtained. At 
the same time, the features on which the subjects in the 
cluster show high similarity are also identified in a col- 
umn cluster as indicated by the non-zero components of 

v. More clusters can be obtained by repeating the opti- 
mization process with modified data matrices. To obtain 
subsequent clusters that are disjoint from any identified 
cluster in terms of subjects, the SSVD solves Eq.(l) using 
a new matrix M that excludes subjects (rows) already 
included in a row cluster. To obtain subsequent clusters 
that allow overlapping of subjects with identified clus- 
ters, the SSVD can solve Eq.(l) with the deflated M = 
M — auv r that removes the identified SVD components 
as used in the standard SVD. 

The proposed formula for two-view joint biclustering 

In this section, we extend the single-view SSVD to find a 
consistent grouping of subjects across two data matrices. 
In a later section, the resulting method will be extended to 
incorporate more than two data matrices. 

Assume that two data matrices denoted by Mi of size 
n-by-di and M2 of size n-by-^2 characterize the same set 
of n subjects from two different views. We can obtain ui, 

vi, and U2, V2 by a separate SSVD of Mi and M2, respec- 
tively. However, it will not guarantee that the row clusters 
specified by ui and 112 agree. To make them consistent, 
ui and U2 must have non-zero components at the same 
position. Note that the two u vectors are not necessarily 
the same, because they may be derived from very different 
features in the views, such as real-valued clinical features 
but discrete values in genetic markers. 

We propose to use a binary vector z of size n that 
serves as a common factor to link the two views. Each 
component of u is then multiplied by the corresponding 
component of z, i.e., U{ = U{Zi. In other words, we rep- 
resent each u vector by z O u in the objective function 
of SSVD to construct the sparse, rank one approximation 
matrices of Mi and M2, simultaneously. When z is sparse, 



both z O ui and z O u.2 will be sparse. Thus, we enforce 
the sparsity of z rather than individual u and solve the 
following optimization problem: 

min ||Mi- oi(z O ui)vf |||+ ||M 2 - cr 2 (z 0 u 2 )vj||| 

z,Gi,Ui,Vi,i=l,2 

+ k z \\z\\ 0 + A V1 ||CT1V1 ||o +^v 2 ll° r 2V2||o> 

subject to || Ui || 2 = 1, II 1| 2 = 1, * = 1,2, 
z e B n . 

(2) 

where X z , k Vl and k V2 are tuning parameters that bal- 
ance the approximation errors and regularization terms. 
Although the values of us are constrained to be unit vec- 
tors, the values of z O us are not necessarily unit vectors. 
However, a careful examination reveals that for any opti- 
mal solution u, we can find another optimal solution u 
that has non-zero values only at the entries indicated by 
the binary vector z, which ensures that z O u is also a unit 
vector. We first set U(j) = if zq) 7^ 0, or U(j) = 0 other- 
wise, for j = 1 • • • , n. We then update the corresponding 
singular value a = cr||u||2 and rescale u = u/||u||2. 
This new vector u satisfies the constraints of Eq.(2), and 
together with the new o will produce the same objective 
value as the original solution u, thus corresponding to an 
optimal solution as well. We design a fast algorithm in a 
later section to find such a sparse u for Eq.(2). 

We discuss two alternatives to the proposed formula (2). 
A restricted version of Eq.(2) may require ui = U2 = u 
and then replace z O ui and z 0 U2 by the same u in the 
objective function of Eq.(2), which leads to the following 
problem 

min ||Mi - criuvf ||* + ||M 2 - or 2 uvJ||J 

<7/,u,v/,i=l,2 

+ A M ||u||o + X n II 0\ Vi II 0 + A. V2 II cr 2 V 2 1| 0, 

subject to ||u|| 2 = 1, ||v/|| 2 = 1, i = 1, 2. 

(3) 

By requiring u to be sparse, it can also identify consistent 
row clusters between two views. The resultant optimiza- 
tion problem is easier to solve without integer variables 
in z. However, it is an unnecessarily stringent constraint 
to limit the search space to ui = U2, which rules out a 
number of potential solutions that may include the opti- 
mal row clusters. Another alternative is to minimize the 
difference between ui and U2, which suffers from the same 
over-constrained problem because the exact values of the 
difference are not involved. Our problem only seeks to 
identify the indicators of whether or not a component of 
u is zero. 

It is also useful to discuss the relation between Eq.(3) 
and the feature concatenation method, which simply 
merges the features from the two views in a cluster anal- 
ysis. The feature concatenation method finds a single set 
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of u and v for the data matrix [Mi M2] by solving the 
following problem 



min 

<7,VL,\ 



subject to 



|| [Mi M 2 ] -cruv T \\ 2 F + X u \\au\\ 0 + A v ||orv|| 0 

I|U|| 2 = 1, ||V|| 2 = 1. 

(4) 



where the v vector is of size d\ + d<i< In comparison with 
Eq.(3), Eq.(4) uses a single o for the two views, and the 
concatenated v is constrained to be a unit vector rather 
than individual vi and V2. It is easy to show that any opti- 
mal solution to Problem (3) can become a feasible solution 
to Problem (4) by properly rescaling vi and V2 and absorb- 
ing the scaling factors by o\ and 02 to make o\ — <X2, but 
is not necessarily an optimal solution to Problem (4). An 
optimal v to Problem (4) may have either vi or V2 be zero, 
which is however not allowed in Eq.(3). When one of the 
v vectors is zero, the resultant clusters differ only on one 
view of the features. As an example, we concatenated 64 
clinical features to 1248 SNPs in a disease sub typing anal- 
ysis. Because the genetic markers outweighed the clinical 
features, the resultant clusters differed significantly only 
on the SNPs, leading to disease subtypes that could not be 
clinically recognized. 

A fast algorithm for two-view joint biclustering 

The proposed formulation (2), although is a mixed-integer 
program, can be effectively solved after proper relax- 
ations. We design an alternating optimization algorithm 
to solve it by splitting the variables into three working sets: 
one set consists of the u vectors; one set consists of the v 
vectors; and the last set consists of the binary variables in 
z. We optimize the variables in one working set at a time 
in alternative steps. 

(1) Find the optimal m, vi, m, and V2 with fixed z 

When z is fixed, Problem (2) can be decomposed into 
two sub-problems that optimize with respect to each 
individual view. Without loss of generality, we show 
how to optimize ui and vi by solving the following 
sub-problem with a fixed z. 



min ||Mi -ai(zOm)vf ||| + X n \\a\y\ || 0 

01,111, vi 



subject to || mil 2 = 1, || vi || 2 = 1, 



(5) 



which can be solved by alternating between 
optimizing for u and for v. 

(a) Solve for\\ when m is fixed 

We solve the following equivalent problem 
for the optimal vi by relaxing the unit length 



constraint on vi, and then setting o\ = \\v1W2 
and vi = y\jo\. 



min ||Mi 

vi 



(zOm)vf|£ + A Vl ||vi||o. 



(6) 



Similar to the single-view SSVD, we relax the 
0-norm to have the i\ vector norm, and solve 



for v by minimizing ||Mi — (z O ui)v^ ||£+ 
X Vl || vi ||i. Each component in vi can be 
computed independently from the others by 
solving 



2a (j)Vi(j) +2/3|vi, (/) |, 



where = ufMi^y), and p = X Vl /2. This 
problem can be solved analytically by 
soft-thresholding [12]: 



v i(/) = 



if a(j) > 0, 
0, if |a w | <fi, j=l,...,d. 
ay) + p, if a(j) < -p, 



(7) 



(b) Solve for ui when vi is fixed 

After vi is obtained and fixed, we optimize 
Problem (5) with respect to o\ and m. We let 
m = <7iui, and solve the following problem 
to obtain m. By setting o\ = ||m H2 and 
ui = ui/(Ji, we obtain a solution to Problem 
(5). 



min 

ui 



||Mi 



(zGu^vflll. 



(8) 



Each component ui(/) in an optimal m can 
be independently and analytically computed 
as follows: 



U K0 



Mi (l -,.)Vi 



z (0 



ifz (0 #0 



1,- 



,n. 



0, if z (0 = 0. 



(9) 



(2) Find the optimal z with fixed m, vi, m, and V2 

When all values of us and v's are fixed in Problem 
(2), the optimization problem becomes: 

min ||Mi - m(z O m)vf \\ 2 F + ||M 2 

zeB n ,cri,(j2 

-a 2 (zOu 2 )vJ||2 +A z ||z|| 0 . 

(10) 

Denote the values of a/s from the previous iteration 
by &i and 02- We temporarily relax the binary z 
variables to be real-valued and then let z = aiz. 
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Again, we use the £i-norm of z to approximate its 
0-norm and solve the following problem for z: 



min ||Mi-(zOui)v 1 r |||+||M 2 



-(a 2 /a 1 )(zQu2)y^\\ 2 F + x z \\z\\ 1 

(11) 

The normalization step for z by o\ is used to contrast 
the different singular values for the different views so 
re-scaling z will not cause an issue. Note that 
Problem (11) can be rewritten as follows: 

min ||M-diag(z)E||f. + Az||z||i 

z 

where M =[ Mi M2] is obtained by concatenating 
the data matrices in columns, E =[ uivf (02/ 'o\) 
U2vJ], and diag(z) converts z into a diagonal matrix. 
Then, each component of an optimal z can be 
analytically computed as follows: 



Z(o = 



Yd) ~ 0, y (i) > 0 

0, \Y(i)\<0 
Y(i) + 0> Y(i) < -0 



i = 1, • 



, n. 



(12) 



where ^=%S and0= ^- Eq - (i2)is 

derived based on the same calculation in [12] which 
was used to derive Eq.(7). 

After obtaining z, the solution z to Problem (10) can 
be calculated as follows: 



z (0 = 



1, ifz (0 #0 
0, if z (0 = 0. 



i = 1, • • • , /z. 



(13) 



To preserve the same objective value of Problem (2) 
after updating z, we update ui and 112 as follows: 



u (0 = 



u (0 /z (0 , if z ( i) ^ 0, 
0, if z ( /) = 0, 



(14) 



and (7i, (72 are recalculated as: ai = || Hi || 2, 
(72 = (£2/01) II 1*2 II 2; then we normalize ui and U2 by 
ui =ui/||ui|| 2 , andu 2 = u 2 /||u 2 ||2. 
The proposed algorithm alternates between solving 
the three sub-problems (6), (8) and (10) until a local 
minimizer is reached. The overall objective is 
monotonically non-increasing when minimizing 
each sub-problem, so the convergence of this 
iterative process is guaranteed. When applied to both 
synthetic and real world datasets, this process 
reached a convergent point in about 10 iterations. To 
derive another row subgroup, we repeat the 



algorithm using new matrices Mi and M2 that either 
exclude the rows corresponding to the subjects in the 
identified subgroup or are deflated by subtracting the 
identified singular value components auv r . By 
repeating this procedure, the desired number of 
subject groups can be achieved. 

Extension to more than two views 

In some applications, more than two views of data can 
be available. For example, besides data on clinical features 
and genetic markers, gene expression data may also be 
used in the analysis. The optimization problem (2) can be 
readily extended to incorporate m separate data matrices, 
e.g., Mi, i = 1, •, m, as follows: 



min V ||M/ - or/(z O u;)vf \\ 2 F + A. z ||z|| 0 

z,<7i,Ui,Vi,i=l,...,m 

i—1 

m 

+ ^A v .||(7^||o, 
i=l 

subject to || u; II 2 = 1, ||v/||2 = 1, i = 1, • • - ,m, 
z e B n . 

This problem can be solved similarly by decomposing it 
into several sub-problems and solving each sub-problem 
in turn. We obtain the singular vectors of the data matrix 
in the view /, i.e., u; and v; while fixing z and other us and 
vs by optimizing: 

min ||M; - or/(z O u/)vf \\ 2 F + A. v Jcr/v/||o, 

Subject tO ||Ui||2 = l,||Vi||2 = l. 

Note that when z is fixed, the optimization of u/ and v/ 
is independent from one another among different views. 
Thus, these singular vectors can be computed in paral- 
lel, which can reduce the computation time significantly 
when more computational resources are available. When 
Ui and Vj are fixed for all views, we solve the following 
problem to obtain z and rescale z to obtain z: 



min V ||M/ - fo/aiXz O u,-)vf \\ 2 F + ^||z||i. 
i i=i 

Algorithm 1 summarizes all of the related steps to solve a 
multi-view SVD. Again, this algorithm can be repeated to 
obtain subsequent clusters in iterations. Although a good 
initialization can be problem-specific, we chose to initial- 
ize z with a vector of all ones, which assumes that all 
subjects have the potential to be in the cluster if no prior 
is given. 
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Algorithm 1 Multi-view Singular Value Decomposition 
Input: Mi, X z , X Vp i= 1, • • • , m 
Output: z, 07, u/, v/, / = 1, • • • , m 

1. Initialize z with a vector of all ones. 

2. Initialize u/'s by the corresponding left singular 
vectors of M/, / = 1, • • • , m. 

3. For i= 1, - • • ,m, 

Compute Vj by Eq.(7). 

Compute v/ from v/ and update a/. 

Compute u/ by Eq.(9). 

Compute u/ from u/ and update a/. 

4. Compute z by Eq.(12). 

5. Compute z from z by Eq.(13). 

6. Update or/, u/, i = 1, • • • , m by Eq.(14) accordingly. 
Repeat Steps 3 to 6 until z reaches a fixed point. 



Results and discussion 

We first validated the proposed method using synthetic 
data that were simulated with known cluster and associa- 
tion structures. We then evaluated our approach on a real 
world disease dataset aggregated from multiple genetic 
studies of cocaine dependence (CD). 

Normalized mutual information (NMI) was used to 
measure the agreement between any two cluster solutions. 
Denote two clusterings by and where each clus- 
tering contains a number of clusters as a partition of a 
given sample, and C; is a set containing indexes of the 
subjects in the i-th cluster. NMI computes the mutual 
information between the two clusterings normalized by 
the cluster entropies. In other words, 



NMI(C (1) ,C (2) ) = 



I(C^\C^) 



(H(C^)+H(CW))/2 



(15) 



ir (1) nr (2) i n\c (1) nr (2) \ 
where /(C<»,C«>) = £ v log ^j^, H(C) 

= l°g ir> an d \Ci\ denotes the number of sub- 

jects in the cluster C/. Because the true clusters are 
known in synthetic data, we computed NMI to measure 
the agreement between the true cluster assignments and 
the cluster assignments resulting from cluster analysis. A 
higher NMI value indicates better performance. 

In addition to NMI, for each clustering, classifiers were 
constructed based on genetic markers to separate subjects 
in different clusters. We used the Area Under the receiver 
operating characteristic Curve (AUC) [15] in a 10-fold 
cross-validation setting to measure the genetic separa- 
bility or homogeneity of the clusters in a clustering and 
compared it between different clusterings. We used a reg- 
ularized logistic regression [16] as the classification model 
in these experiments. 



We compared the proposed approach extensively 
against biclustering and multi-view analytics. We cal- 
culated NMI for different methods on synthetic data 
and AUC values on both synthetic and real world data. 
Our comparison study included the following existing 
methods: 

• Single-view SSVD: Clusters were included in the 
comparison by running the method of SSVD-based 
biclustering in the clinical view, as the biclustering 
method does not handle multiple views. Applying this 
method to genetic data created completely different 
clusters from those obtained in the clinical view. 

• Co-regularized spectral: This method was proposed 
previously [9] to find consistent row clusters across 
multiple views by applying spectral clustering to each 
view in turn together with a co-regularization factor 
applied to the cluster indicator vector. 

• Kernel addition: Radial basis function (RBF) kernels 
were calculated for each view and combined by 
summing them. Then spectral clustering was applied 
to the combined kernel to obtain row clusters. 

• Kernel product: This is the same procedure as in the 
kernel addition described above except that kernel 
matrices were combined by multiplying their 
components in the same position. 

• Feature concatenation: Data from the two views 
were combined by feature concatenation and a kernel 
matrix was computed based on the combined 
features. It was then used in spectral clustering to 
obtain row clusters. 

A simulation study 

Two disease subtypes, subtype 1 and subtype 2, were sim- 
ulated. Each of the subtypes was both defined by a set 
of phenotypic/clinical features and associated with a set 
of genetic markers. However, the clinical features and 
genetic markers differed for the two subtypes. Thus, each 
subtype corresponded to a cluster of subjects with the 
specific clinical features and the associated SNP markers 
(here we assumed that minor alleles at each locus were risk 
variants). The goal of the simulation was to create a ref- 
erence partitioning of subjects in both views (i.e., genetic 
markers and clinical features). 

Genetic data were obtained from the 1000 Genome 
Project [17], in which 1092 subjects were genotyped for 
several million genetic markers. We randomly selected 
1000 markers from chromosome 5 that had a minor allele 
frequency of at least 5% as genetic inputs in our experi- 
ments. Ten markers (different for each subtype) were ran- 
domly chosen to be associated with each subtype. Thus, 
a cluster of subjects was formed for each subtype, and we 
assigned subjects to a cluster if they had > 8 risk variants 
out of the 10 SNPs chosen for that subtype. This amounts 
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to an additive genetic model for each subtype (i.e., derived 
by adding the risk variants). Subjects who did not belong 
to either of the subtypes were treated as controls, forming 
the third subject cluster. We removed from the analysis 
subjects who belonged to both subtypes to ensure clarity 
in the partition. A total of 1013 subjects were retained. Of 
these, 247 and 167 were assigned to subtype 1 and subtype 
2, respectively, and 599 were controls. We named these 
clusters the genotypic clusters. 

We then created clusters of the same subjects in the 
clinical view to be consistent to a certain degree with 
the genotypic clusters. Note that many diseases, although 
highly heritable, are multifactorial genetically and envi- 
ronmentally. To reflect the environmental effects on the 
clinical features, we introduced random noise to the syn- 
thesized clinical data so that the clinical clusters were not 
exactly the same as the genotypic clusters, so as to test the 
robustness of the proposed approach. We used a param- 
eter e to indicate the relative effect that genetic variation 
contributed to the phenotypic variation. Denote the 
number of risk variants of subtype j shared by subject i, 
so 0 < < 10 according to our definition of genotypic 
clusters. If rj. * e + N(0, 1) > 7.5 * e, we assigned subject i 
to subtype j. This process created clusters of subjects that 
were different but similar to the genotypic clusters (with 
the parameter e reflecting the level of similarity). 

We named these clusters the phenotypic clusters 
because they were used to synthesize clinical features such 
that the clinical data represented these clusters. Similarly, 
we removed from the analysis subjects that overlapped in 
the two phenotypic clusters. Fewer than 15 subjects were 
excluded in any simulated dataset in the experiments. 
In addition to these two phenotypic clusters, two addi- 
tional phenotypic clusters, independent of any genetic 
variant and based on clinical features only, were created 
to make the simulated data more difficult but more real- 
istic. Each of the two additional clusters included 200 
subjects that were randomly selected among the controls. 
This design aimed to reflect the observation that multiple 
clinical clusters may exist in a sample, but only some clus- 
ters (two in our simulations) are associated with genetic 
factors. 

We simulated 10 binary phenotypic/clinical features 
that exhibited the phenotypic clusters. A subject was 
assigned a value of 0 or 1 for each of the features according 
to a pre-defined probability. Subtype 1 and subtype 2 each 
were associated with three features. Subjects in each sim- 
ulated phenotypic cluster were assigned a value of 1 with 
probabilities of 0.6, 0.5, and 0.4, respectively, for the three 
designated features. Each of the two additional phenotypic 
clusters was associated with two features, and subjects in 
each of the two subtypes were assigned a value of 1 in the 
two features, with probabilities of 0.6 and 0.5, respectively. 



A subject was assigned a value of 1 with a probability of 
0.1 on any other features. 

To evaluate how the proposed method performed when 
the genetic effect varied, four phenotypic datasets with 
e = 1, 0.8, 0.6, and 0.4 were generated and analysed. 
The genetic effect on phenotypic variation decreases with 
decreasing e, which leads to a lower level of agreement 
between genotypic and phenotypic clusters. 

All of the available methods were used to obtain 
three subject clusters. Table 1 provides the NMI calcu- 
lated by comparing subject clusters obtained from each 
approach to the simulated phenotypic clusters. The pro- 
posed method has the highest NMI on all four of the 
datasets. With decreasing e, the NMI obtained by the pro- 
posed method decreases gradually, as expected, but the 
subject clusters consistent between the two views can still 
be discerned. 

For each cluster solution, two classification models were 
built to separate subjects in each of the two subtypes 
from controls. The subject cluster from each method 
containing the largest number of controls was consid- 
ered the control group. The average AUC values and their 
interquantiles obtained by all compared approaches on 
each dataset are plotted in Figure 1. The proposed method 
achieved the second best performance on this mea- 
surement. Although the feature concatenation method 
obtained the clusters that were most separable genetically 
(i.e., with the best AUC), the clusters were not clinically 
recognizable. As shown in Table 1, they were the most 
disparate from the simulated true phenotypic clusters. 

A significant advantage of the proposed method is that 
it can simultaneously identify the features that specify the 
subject clusters. We calculated the number of features 
that were correctly and incorrectly identified by the pro- 
posed method to measure its performance in this regard. 
The results are summarized in Table 2, which shows that 



Table 1 Comparison of different methods on their cluster 
validity in the simulation 





e = 1 


e = 0.8 


e = 0.6 


e = 0.4 


Single-view SSVD 


0.0821 


0.1798 


0.2432 


0.2286 


Co-regularized Spectral 


0.2306 


0.2477 


0.2338 


0.2549 


Kernel addition 


0.2587 


0.2295 


0.2350 


0.2566 


Kernel product 


0.1917 


0.2432 


0.2302 


0.2310 


Feature concatenation 


0.1569 


0.1576 


0.1532 


0.1211 


Proposed method 


0.7949 


0.7693 


0.6815 


0.6329 



The normalized mutual information (NMI) values are shown, measuring the 
agreement between the clusters resulting from an approach and the simulated 
phenotypic clusters. The genetic contribution to the phenotypic variation varied 
according to different e values. A greater e value indicates a higher agreement 
between the simulated phenotypic clusters and genotypic clusters, making it 
easier for a clustering approach to recover the simulated phenotypic clusters. 
Italic fonts indicate the best performance in the experiments with each of the e 
values. 



Sun etal. BMC Genetics 2014, 15:73 
http://www.biomedcentral.eom/1 471-21 56/1 5/73 



Page 8 of 1 2 



e=1 



e=0.8 




Figure 1 Comparison of different methods on AUC values in the simulation. The box plot of AUC values obtained from all approaches in the 
comparison is shown for the simulated data. The methods were: A1 - the proposed method, A2 - single-view SSVD, A3 - co-regularized spectral 
clustering, A4 - kernel addition, A5 - kernel product, and A6 - feature concatenation. The parameter e reflects the level of genotypic effect to the 
phenotypic variation in the simulated data. The AUC values characterize the genetic separability of the clusters resulting from each method. 



our approach correctly identified all true associated fea- 
tures in both views with a very low false discovery rate 
(~ 15/1000) when taking into account the total number 
of features used in the analysis. 

A disease study: cocaine use and related behaviors 

A total of 1,474 African Americans were phenotyped 
and genotyped for genetic studies of cocaine depen- 
dence (CD) [18]. Subjects were recruited from the Yale 
University School of Medicine, University of Connecti- 
cut Health Center, University of Pennsylvania School of 

Table 2 The features identified by the proposed method in 
both views in the simulation 



Phenotypic view Genotypic view 







TF 


TPF 


FPF TF 


TPF 


FPF 




e= 1 




3 


1 


10 


4 




e = 0.8 




3 


1 


10 


5 


Subtype 1 




3 




10 








e = 0.6 




3 


2 


10 


15 




e = 0.4 




3 


0 


10 


10 




e = 1 




3 


0 


10 


4 




e = 0.8 




3 


0 


10 


4 


Subtype 2 




3 




10 








e = 0.6 




3 


0 


10 


2 




e = 0.4 




3 


0 


10 


5 



The parameter e reflects the level of genotypic effect to the phenotypic variation 
in the simulated data. TF is the number of True Features used in the simulation 
to specify a subject cluster. TPF (True Positive Features) is the number of features 
correctly identified. FPF (False Positive Features) is the number of features 
incorrectly identified. 



Medicine, McLean Hospital and Medical University of 
South Carolina. All subjects gave written, informed con- 
sent to participate, using procedures approved by the 
institutional review board at each participating site. Sub- 
jects were phenotyped using a computer-assisted inter- 
view, called the Semi-Structured Assessment for Drug 
Dependence and Alcoholism (SSADDA) [19], a polydi- 
agnostic instrument that was used to generate diagnoses 
of dependence on cocaine and other substances. Sixty- 
four yes-or-no variables were generated by this survey, 
which were also used in previous genetic association stud- 
ies [1,20,21]. These variables were used as the phenotypic 
features. Of the 1,474 subjects, 1,287 were diagnosed with 
cocaine dependence. Subjects were genotyped for 1,350 
SNPs selected from 130 candidate genes [4] and 186 
ancestry informative markers (AIMs) using the Illumina 
GoldenGate Assay platform (Illumina, Inc., San Diego, 
CA). 

The original dataset aggregated from two studies was 
preprocessed with a sequence of steps for data cleaning 
and to address population stratification. Race was clas- 
sified using STRUCTURE v2.3 [22] and AIMs, which 
stratified the study subjects into two population groups: 
African Americans (A As) and European Americans (EAs). 
The AA group was used in the present analysis. Of the 
1,474 AAs, 93.78% had AA as their self-reported race. We 
excluded other population groups from the analysis. Prin- 
cipal components analysis (PCA) was performed on the 
186 AIMs for the stratified AA population. The first PCA 
dimension was used in the subsequent association tests as 
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Figure 2 Comparison of different methods on AUC values in the 
CD study. The box plot of AUC values were obtained from all 
methods on the data of cocaine use and related behaviors. A1 - the 
proposed method, A2 - single-view SSVD, A3 - co-regularized spectral 
clustering, A4 - kernel addition, A5 - kernel product. 



a covariate to correct for the residual population structure. 
SNPs for which data were available for less than 95% of 
the subjects, or for which the P value for Hardy- Weinberg 
equilibrium was less than 10 -7 , were excluded from our 
analysis. The minor allele frequency (MAF) of each SNP 
was calculated within this AA population group. SNPs 
with a MAF < 1% were removed. The remaining 1,248 
SNPs were used as the genetic markers in the multi-view 
biclustering experiment. The SNPs selected by the pro- 
posed Algorithm 1 were then used in the association test 
that was based on the logistic regression model. 



The feature concatenation method overlooked the infor- 
mation in the clinical or phenotypic view as observed in 
both the simulation study and the case study. Thus, we 
excluded the feature concatenation method from further 
comparisons. Three subject clusters were obtained from 
each of the methods in the comparison. Logistic regres- 
sion models were built with sex, age and the first PCA 
dimension as covariates and tested in a manner similar to 
that used for synthetic data. Figure 2 shows the box plot 
of the AUC values. As shown there, our approach signifi- 
cantly outperformed all other methods with respect to the 
genetic separability of the resultant clusters. A paired t- 
test to compare the AUC values from our method with 
each of the other methods yielded a ^-values < 0.05 for all 
comparisons. 

For the proposed method, the three identified subject 
clusters contained 795 (Group 7), 295 (Group 2) and 384 
(Group 3) subjects. Group 1 and Group 2 were identi- 
fied consecutively, and Group 3 contained the remaining 
subjects. Group 3 contained more than 80% of the con- 
trol subjects; thus, we used this group as a control group 
in our association analysis. The number of clinical fea- 
tures identified as associated with Group 1 and Group 2 
were 18 and 17, respectively. Figures 3 and 4 compare 
the three subject clusters on the percentage of positive 
responses to the identified clinical features. A few identi- 
fied features are not shown in the figures, because they are 
highly correlated (r > 0.7) with the features shown. 

From these two figures, we can see that Group 1 is dis- 
tinctively associated with several withdrawal symptoms, 
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I Felt depressed 1 

I Had trouble concentrating 1 

I Felt depressed 2 

] Felt restless 2 

] Felt tired 2 

] Had trouble sleeping 2 

] Desire for cocaine 2 

I Felt slowed down 2 

I Daily functioning was interfered 1 




IHl IHIHHI 
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Patient clusters 

Figure 3 Comparison among the three cocaine user groups on the features identified for Group 7. Cocaine use symptoms are identified by 
the superscript 1 , and the symptoms due to stopping, cutting down or going without cocaine are identified by the superscript 2 . The percentage of 
individuals endorsing any of the features are reported for each user group. 



Sun etal. BMC Genetics 2014, 15:73 
http://www.biomedcentral.eom/1 471-21 56/1 5/73 



Page 10 of 12 



0.8 



c 

o 

Q. 



g 0.6 



o 

Q. 

M— 

O 
CD 

cn 

03 

+—> 
c 

CD 
U 
i— 

CD 
CL 



0.4 



0.2 



Often used on more days or in larger amount 
Spent a great deal of time with cocaine 
Stayed high for a whole day or more 
Had a paranoid experience 
Decreased contact with friends or family 
Could not stop or cut down on cocaine 
Objection from others because of cocaine use 
Being treated for a problem with cocaine 
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Figure 4 Comparison among the three cocaine user groups on the features identified for Group 2. The percentage of individuals endorsing 
any of the features are reported for each user group. 



such as feeling depressed, restless, or tired when the 
subject stopped, cut down or went without cocaine. When 
Group 2, the second row cluster, was identified, the cor- 
responding column cluster contained 17 clinical features. 
We plotted the percentage of positive responses to eight of 
these features for all three cocaine user groups in Figure 4. 
Subjects in both Group 2 and Group 1 showed high val- 
ues on these features. Note that subjects in Group 1 were 
excluded when the second cluster was derived. From these 
observations, we can conclude that Group 1 is a heavy user 
group with many negative consequences of cocaine use, 
Group 2 is a moderate cocaine user group, and Group 3 is 
a low cocaine user group. 

There were 114 and 237 genetic markers identified for 
Group 1 and Group 2, respectively, by Algorithm 1. Based 
on these markers, two logistic regression models were 
built to identify the markers that had the highest pre- 
dictive power in distinguishing subjects in Group 1 or in 
Group 2, from those in the control group. Table 3 gives the 
5 SNPs that received the largest magnitude of weights in 
the models. It is interesting to note that the HTR2C gene 
was significantly associated with Group 1 in our study (p- 
value < 10 -5 ), having previously been identified with a 
heavy use, early-onset and high comorbidity subtype of 
cocaine dependence [20]. 

Conclusion 

It is challenging to identify the genetic causes of com- 
plex disorders such as substance dependence, due to 
their heterogeneous clinical manifestations and complex 
genetic etiologies, which include gene x environment 



interactions. Phenotype refinement that leads to homo- 
geneous subtypes is a promising approach to solve 
this problem [1,5,23-25]. However, most of the meth- 
ods used to refine phenotypes take into consideration 
only the phenotypic information, despite the availability 
of genotypic information in genetic studies of a com- 
plex disorder. Thus, existing approaches have had lim- 
ited success in finding a phenotypic subtype that is 
genetically homogeneous. In this paper, we propose a 



Table 3 Top five SNPs associated with each of the two CD 
subtypes 





SNP 


Chr 


MAF 


HWE 


Gene 




rs6318 


chrX 


0.3643 


1.00 


HTR2C 


Group 1 


rs2427400 


chr20 


0.1280 


0.22 


NTSR1 


vs. 


rs460401 


chr21 


0.3500 


0.18 


GRIK1 


Group 3 


rs1 0485058 


chr06 


0.0585 


0.38 


OPRM1 




rs2279423 


chr15 


0.0237 


0.81 


CHRM5 




rs897692 


chrl 1 


0.3972 


0.86 


HTR3A 


Group 2 


rs9996854 


chr04 


0.5436 


0.61 


GABRB1 


vs. 


rs481036 


chrOI 


0.5582 


0.21 


CHRM3 


Group 3 


rs6092933 


chr20 


0.2070 


0.17 


SLC32A1 




rs9371781 


chr06 


0.3687 


0.49 


OPRM1 



The five SNP markers that received the largest magnitude of weights in the two 
classification models that separate the subtype cases, in Group 7 and Group 2, 
respectively, from the controls in Group 3. The SNP name, the SNP location 
(chromosome i.e., Chr), the name of the gene (Gene), the minor allele frequency 
(MAF) and the P-value for Hardy Weinberg equilibrium (HWE) are provided for 
each SNP. 
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multi-view biclustering approach to refine the phenotype 
by jointly taking into account genetic and phenotypic 
information. 

The proposed method is distinct from existing multi- 
view data analytics in that the relevant features can 
be identified at the same time that a subtype is deter- 
mined, which is critical to its success. This increases the 
likelihood of finding genetic associations. The proposed 
method is distinct from existing biclustering methods 
in that it harmonizes the subject groupings in two or 
more views. The developed algorithm is highly scalable 
with large datasets because at each iteration it calcu- 
lates closed-form solutions for different groups of working 
variables. The results from extensive experimental com- 
parisons on both synthetic data and real world datasets 
demonstrate the effectiveness and superior performance 
of the proposed approach. 

This study has a number of limitations. The proposed 
multi-view biclustering method, in its current form, does 
not simultaneously handle population stratification and 
phenotype-genotype association. It may spuriously iden- 
tify markers that are relevant to a disease subtype due 
to population structure rather than being truly associ- 
ated with the specific disease. Thus, population groups 
need to be stratified in additional steps such as those per- 
formed in our experiments. It is desirable to extend our 
method to address the three-way relationship among pop- 
ulation subgroups, genotypes and phentoypes to ensure 
the validity of the identified phenotype-genotype asso- 
ciations. Further, the proposed method was used in our 
empirical study to identify the first two major sub- 
groups of subjects, for which no invalid clusters caused 
by random noise were identified. When larger numbers 
of clusters are to be identified, the two methods we 
designed to find subsequent clusters (by either exclud- 
ing subjects in the identified subgroups or deflating sin- 
gular value components from the data matrix) become 
susceptible to the detection of invalid clusters because 
singular values will decrease in subsequent decomposi- 
tion. Empirical studies may be needed to examine more 
thoroughly the signal-to-noise pattern of the proposed 
method. 
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